Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering

نویسندگان

چکیده مقاله:

Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in this paper we propose a method to use weighting in the ensemble clustering problem. The accuracies of base clusterings are estimated using an algorithm from crowdsourcing literature called agreement/disagreement method (AD). This method exploits the agreements or disagreements between different labelers for estimating their accuracies. It assumes different labelers have labeled a set of samples, so each two persons have an agreement ratio in their labeled samples. Under some independence assumptions, there is a closed-form formula for the agreement ratio between two labelers based on their accuracies. The AD method estimates the labelers’ accuracies by minimizing the difference between the parametric agreement ratio from the closed-form formula and the agreement ratio from the labels provided by labelers. To adapt the AD method to the clustering problem, an agreement between two clusterings are defined as having the same opinion about a pair of samples. This agreement can be as either being in the same cluster or being in different clusters. In other words, if two clusterings agree that two samples should be in the same or different clusters, this is considered as an agreement. Then, an optimization problem is solved to obtain the base clusterings’ accuracies such that the difference between their available agreement ratios and the expected agreements based on their accuracies is minimized. To generate the base clusterings, we use four different settings including different clustering algorithms, different distance measures, distributed features, and different number of clusters. The used clustering algorithms are mean shift, k-means, mini-batch k-means, affinity propagation, DBSCAN, spectral, BIRCH, and agglomerative clustering with average and ward metrics. For distance measures, we use correlation, city block, cosine, and Euclidean measures. In distributed features setting, the k-means algorithm is performed for 40%, 50%,…, and 100% of randomly selected features. Finally, for different number of clusters, we run the k-means algorithm by k equals to 2 and also 50%, 75%, 100%, 150%, and 200% of true number of clusters. We add the estimated weights by the AD algorithm to two famous ensemble clustering methods, i.e., Cluster-based Similarity Partitioning Algorithm (CSPA) and Hyper Graph Partitioning Algorithm (HGPA). In CSPA, the similarity matrix is computed by taking a weighted average of the opinions of different clusterings. In HGPA, we propose to weight the hyperedges by different values such as the estimated clustering accuracies, size of clusters, and the silhouette of clusterings. The experiments are performed on 13 real and artificial datasets. The reported evaluation measures include adjusted rand index, Fowlkes-Mallows, mutual index, adjusted mutual index, normalized mutual index, homogeneity, completeness, v-measure, and purity. The results show that in the majority of cases, the proposed weighted-based method outperforms the unweighted ensemble clustering. In addition, the weighting is more effective in improving the HGPA algorithm than CSPA. For different weighting methods proposed for HGPA algorithm, the best average results are obtained when we use the accuracies estimated by the AD method to weight the hyperedges, and the worst results are obtained when using the normalized silhouette measure for weighting. Finally, among different methods for generating base clusterings, the best results in weighted HGPA are obtained when we use different clustering algorithms to come up with different base clusterings.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The ensemble clustering with maximize diversity using evolutionary optimization algorithms

Data clustering is one of the main steps in data mining, which is responsible for exploring hidden patterns in non-tagged data. Due to the complexity of the problem and the weakness of the basic clustering methods, most studies today are guided by clustering ensemble methods. Diversity in primary results is one of the most important factors that can affect the quality of the final results. Also...

متن کامل

the clustering and classification data mining techniques in insurance fraud detection:the case of iranian car insurance

با توجه به گسترش روز افزون تقلب در حوزه بیمه به خصوص در بخش بیمه اتومبیل و تبعات منفی آن برای شرکت های بیمه، به کارگیری روش های مناسب و کارآمد به منظور شناسایی و کشف تقلب در این حوزه امری ضروری است. درک الگوی موجود در داده های مربوط به مطالبات گزارش شده گذشته می تواند در کشف واقعی یا غیرواقعی بودن ادعای خسارت، مفید باشد. یکی از متداول ترین و پرکاربردترین راه های کشف الگوی داده ها استفاده از ر...

A new ensemble clustering method based on fuzzy cmeans clustering while maintaining diversity in ensemble

An ensemble clustering has been considered as one of the research approaches in data mining, pattern recognition, machine learning and artificial intelligence over the last decade. In clustering, the combination first produces several bases clustering, and then, for their aggregation, a function is used to create a final cluster that is as similar as possible to all the cluster bundles. The inp...

متن کامل

Improving Accuracy in Intrusion Detection Systems Using Classifier Ensemble and Clustering

Recently by developing the technology, the number of network-based servicesis increasing, and sensitive information of users is shared through the Internet.Accordingly, large-scale malicious attacks on computer networks could causesevere disruption to network services so cybersecurity turns to a major concern fornetworks. An intrusion detection system (IDS) could be cons...

متن کامل

Improving the Accuracy of Ontology Alignment through Ensemble Fuzzy Clustering

Automatic ontology alignment tools perform matching between the concepts of two ontologies and provide a similarity measure for each pair of aligned concepts. However, none of the existing tools are perfect and multiple alignment tools produce varying similarity measures for a certain alignment. Also, the similarity measures provided by an alignment may not be helpful enough for indicating the ...

متن کامل

Bilateral Weighted Fuzzy C-Means Clustering

Nowadays, the Fuzzy C-Means method has become one of the most popular clustering methods based on minimization of a criterion function. However, the performance of this clustering algorithm may be significantly degraded in the presence of noise. This paper presents a robust clustering algorithm called Bilateral Weighted Fuzzy CMeans (BWFCM). We used a new objective function that uses some k...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}


عنوان ژورنال

دوره 17  شماره 2

صفحات  100- 85

تاریخ انتشار 2020-09

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

کلمات کلیدی برای این مقاله ارائه نشده است

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023